Search CORE

335 research outputs found

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Author: Wang Jialei
Zhang Tong
Publication venue
Publication date: 11/10/2017
Field of study

We present novel minibatch stochastic optimization methods for empirical risk minimization problems, the methods efficiently leverage variance reduced first-order and sub-sampled higher-order information to accelerate the convergence speed. For quadratic objectives, we prove improved iteration complexity over state-of-the-art under reasonable assumptions. We also provide empirical evidence of the advantages of our method compared to existing approaches in the literature

arXiv.org e-Print Archive

Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

Author: Wang Jialei
Xiao Lin
Publication venue
Publication date: 07/03/2017
Field of study

We consider empirical risk minimization of linear predictors with convex loss functions. Such problems can be reformulated as convex-concave saddle point problems, and thus are well suitable for primal-dual first-order algorithms. However, primal-dual algorithms often require explicit strongly convex regularization in order to obtain fast linear convergence, and the required dual proximal mapping may not admit closed-form or efficient solution. In this paper, we develop both batch and randomized primal-dual algorithms that can exploit strong convexity from data adaptively and are capable of achieving linear convergence even without regularization. We also present dual-free variants of the adaptive primal-dual algorithms that do not require computing the dual proximal mapping, which are especially suitable for logistic regression

arXiv.org e-Print Archive

Reducing Runtime by Recycling Samples

Author: Srebro Nathan
Wang Hai
Wang Jialei
Publication venue
Publication date: 05/02/2016
Field of study

Contrary to the situation with stochastic gradient descent, we argue that when using stochastic methods with variance reduction, such as SDCA, SAG or SVRG, as well as their variants, it could be beneficial to reuse previously used samples instead of fresh samples, even when fresh samples are available. We demonstrate this empirically for SDCA, SAG and SVRG, studying the optimal sample size one should use, and also uncover be-havior that suggests running SDCA for an integer number of epochs could be wasteful

arXiv.org e-Print Archive

Distributed Multitask Learning

Author: Kolar Mladen
Srebro Nathan
Wang Jialei
Publication venue
Publication date: 02/10/2015
Field of study

We consider the problem of distributed multi-task learning, where each machine learns a separate, but related, task. Specifically, each machine learns a linear predictor in high-dimensional space,where all tasks share the same small support. We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method

arXiv.org e-Print Archive

Distributed Multi-Task Learning with Shared Representation

Author: Kolar Mladen
Srebro Nathan
Wang Jialei
Publication venue
Publication date: 07/03/2016
Field of study

We study the problem of distributed multi-task learning with shared representation, where each machine aims to learn a separate, but related, task in an unknown shared low-dimensional subspaces, i.e. when the predictor matrix has low rank. We consider a setting where each task is handled by a different machine, with samples for the task available locally on the machine, and study communication-efficient methods for exploiting the shared structure

arXiv.org e-Print Archive

Distributed Stochastic Multi-Task Learning with Graph Regularization

Author: Kolar Mladen
Srebro Nathan
Wang Jialei
Wang Weiran
Publication venue
Publication date: 11/02/2018
Field of study

We propose methods for distributed graph-based multi-task learning that are based on weighted averaging of messages from other machines. Uniform averaging or diminishing stepsize in these methods would yield consensus (single task) learning. We show how simply skewing the averaging weights or controlling the stepsize allows learning different, but related, tasks on the different machines

arXiv.org e-Print Archive

Efficient coordinate-wise leading eigenvector computation

Author: Garber Dan
Srebro Nathan
Wang Jialei
Wang Weiran
Publication venue
Publication date: 25/02/2017
Field of study

We develop and analyze efficient "coordinate-wise" methods for finding the leading eigenvector, where each step involves only a vector-vector product. We establish global convergence with overall runtime guarantees that are at least as good as Lanczos's method and dominate it for slowly decaying spectrum. Our methods are based on combining a shift-and-invert approach with coordinate-wise algorithms for linear regression

arXiv.org e-Print Archive

Multi-Information Source Optimization

Author: Frazier Peter I.
Poloczek Matthias
Wang Jialei
Publication venue
Publication date: 14/11/2016
Field of study

We consider Bayesian optimization of an expensive-to-evaluate black-box objective function, where we also have access to cheaper approximations of the objective. In general, such approximations arise in applications such as reinforcement learning, engineering, and the natural sciences, and are subject to an inherent, unknown bias. This model discrepancy is caused by an inadequate internal model that deviates from reality and can vary over the domain, making the utilization of these approximations a non-trivial task. We present a novel algorithm that provides a rigorous mathematical treatment of the uncertainties arising from model discrepancies and noisy observations. Its optimization decisions rely on a value of information analysis that extends the Knowledge Gradient factor to the setting of multiple information sources that vary in cost: each sampling decision maximizes the predicted benefit per unit cost. We conduct an experimental evaluation that demonstrates that the method consistently outperforms other state-of-the-art techniques: it finds designs of considerably higher objective value and additionally inflicts less cost in the exploration process.Comment: Added: benchmark logistic regression on MNIST/USPS, comparison to MTBO/entropy search, estimation of hyper-parameter

arXiv.org e-Print Archive

Efficient Distributed Learning with Sparsity

Author: Kolar Mladen
Srebro Nathan
Wang Jialei
Zhang Tong
Publication venue
Publication date: 25/05/2016
Field of study

We propose a novel, efficient approach for distributed sparse learning in high-dimensions, where observations are randomly partitioned across machines. Computationally, at each round our method only requires the master machine to solve a shifted ell_1 regularized M-estimation problem, and other workers to compute the gradient. In respect of communication, the proposed approach provably matches the estimation error bound of centralized methods within constant rounds of communications (ignoring logarithmic factors). We conduct extensive experiments on both simulated and real world datasets, and demonstrate encouraging performances on high-dimensional regression and classification tasks

arXiv.org e-Print Archive

Gradient Sparsification for Communication-Efficient Distributed Optimization

Author: Liu Ji
Wang Jialei
Wangni Jianqiao
Zhang Tong
Publication venue
Publication date: 26/10/2017
Field of study

Modern large scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as stochastic gradients among different workers. In this paper, to reduce the communication cost we propose a convex optimization formulation to minimize the coding length of stochastic gradients. To solve the optimal sparsification efficiently, several simple and fast algorithms are proposed for approximate solution, with theoretical guaranteed for sparseness. Experiments on

\ell_2

regularized logistic regression, support vector machines, and convolutional neural networks validate our sparsification approaches

arXiv.org e-Print Archive